1 Introduction

In our final group assignment we are going to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. We have chosen Amsterdam for our analysis.

Our AirBnB data come from insideairbnb.com; it was originally scraped from airbnb.com.

Below we load the data from Airbnb:

2 Exploratory Data Analysis (EDA)

Let’s first get an overview of our data

2.1 Data overview

We use glimpse and skim to gain a general view of the data. We have 74 variables and 16,116 observations. Variables include 24 character variables, 5 date variables, 8 logical variables, and 37 numeric variables. We also have some missing values throughout multiple of the variables in our data.

#Overview of Data
glimpse(listings)
Rows: 16,116
Columns: 74
$ id                                           <dbl> 2818, 20168, 25428, 27886…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021091e+13, 2.021091e+1…
$ last_scraped                                 <date> 2021-09-07, 2021-09-07, …
$ name                                         <chr> "Quiet Garden View Room &…
$ description                                  <chr> "Quiet Garden View Room &…
$ neighborhood_overview                        <chr> "Indische Buurt (\"Indies…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 3159, 59484, 56142, 97647…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Daniel", "Alexander", "J…
$ host_since                                   <date> 2008-09-24, 2009-12-02, …
$ host_location                                <chr> "Amsterdam, Noord-Holland…
$ host_about                                   <chr> "Upon arriving in Amsterd…
$ host_response_time                           <chr> "within an hour", "within…
$ host_response_rate                           <chr> "100%", "100%", "N/A", "8…
$ host_acceptance_rate                         <chr> "100%", "100%", "0%", "10…
$ host_is_superhost                            <lgl> TRUE, FALSE, TRUE, TRUE, …
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Indische Buurt", "Gracht…
$ host_listings_count                          <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_total_listings_count                    <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_verifications                           <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, FALSE, TRUE, …
$ neighbourhood                                <chr> "Amsterdam, North Holland…
$ neighbourhood_cleansed                       <chr> "Oostelijk Havengebied - …
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 52.36435, 52.36407, 52.37…
$ longitude                                    <dbl> 4.94358, 4.89393, 4.88487…
$ property_type                                <chr> "Private room in rental u…
$ room_type                                    <chr> "Private room", "Private …
$ accommodates                                 <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 2…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1.5 shared baths", "1 pr…
$ bedrooms                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ beds                                         <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2…
$ amenities                                    <chr> "[\"Single level home\", …
$ price                                        <chr> "$59.00", "$106.00", "$12…
$ minimum_nights                               <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ maximum_nights                               <dbl> 28, 365, 120, 730, 1825, …
$ minimum_minimum_nights                       <dbl> 3, 1, 7, 2, 2, 2, 90, 4, …
$ maximum_minimum_nights                       <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ minimum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ maximum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ minimum_nights_avg_ntm                       <dbl> 3.0, 1.0, 13.8, 2.0, 2.0,…
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 120, 1125, 18…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 3, 0, 1, 9, 11, 8, 4, 0, …
$ availability_60                              <dbl> 28, 0, 1, 20, 27, 23, 34,…
$ availability_90                              <dbl> 55, 0, 3, 47, 50, 50, 64,…
$ availability_365                             <dbl> 124, 0, 57, 66, 298, 313,…
$ calendar_last_scraped                        <date> 2021-09-07, 2021-09-07, …
$ number_of_reviews                            <dbl> 280, 339, 5, 223, 353, 49…
$ number_of_reviews_ltm                        <dbl> 2, 0, 0, 4, 19, 17, 0, 0,…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 2, 8, 10, 0, 0, …
$ first_review                                 <date> 2013-08-25, 2014-01-17, …
$ last_review                                  <date> 2019-11-21, 2020-03-27, …
$ review_scores_rating                         <dbl> 4.89, 4.44, 5.00, 4.95, 4…
$ review_scores_accuracy                       <dbl> 4.93, 4.69, 5.00, 4.93, 4…
$ review_scores_cleanliness                    <dbl> 5.00, 4.79, 5.00, 4.96, 4…
$ review_scores_checkin                        <dbl> 4.97, 4.63, 5.00, 4.95, 4…
$ review_scores_communication                  <dbl> 4.97, 4.62, 5.00, 4.92, 4…
$ review_scores_location                       <dbl> 4.68, 4.87, 5.00, 4.90, 4…
$ review_scores_value                          <dbl> 4.81, 4.49, 4.80, 4.80, 4…
$ license                                      <chr> "0363 5F3A 5684 6750 D14D…
$ instant_bookable                             <lgl> TRUE, TRUE, FALSE, TRUE, …
$ calculated_host_listings_count               <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 1, 2, 0, 1, 2, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 2.86, 3.64, 0.11, 2.14, 4…
skim(listings)
Data summary
Name listings
Number of rows 16116
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 16116 0
name 30 1.00 1 240 0 15766 0
description 223 0.99 1 1000 0 15711 0
neighborhood_overview 5711 0.65 1 1000 0 9495 0
picture_url 0 1.00 61 126 0 15968 0
host_url 0 1.00 38 43 0 14197 0
host_name 5 1.00 1 33 0 5166 0
host_location 35 1.00 2 86 0 560 0
host_about 7066 0.56 1 8920 0 7522 16
host_response_time 5 1.00 3 18 0 5 0
host_response_rate 5 1.00 2 4 0 57 0
host_acceptance_rate 5 1.00 2 4 0 91 0
host_thumbnail_url 5 1.00 55 106 0 14161 0
host_picture_url 5 1.00 57 109 0 14161 0
host_neighbourhood 5713 0.65 4 35 0 68 0
host_verifications 0 1.00 2 158 0 361 0
neighbourhood 5711 0.65 11 59 0 67 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 67 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 21 1.00 6 17 0 26 0
amenities 0 1.00 2 1488 0 14667 0
price 0 1.00 5 9 0 498 0
license 11561 0.28 6 63 0 3624 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
host_since 5 1.00 2008-09-24 2021-08-31 2015-04-07 3235
calendar_last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
first_review 2087 0.87 2011-04-19 2021-09-07 2017-10-30 2467
last_review 2087 0.87 2011-06-12 2021-09-07 2019-06-10 2019

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.12 FAL: 14103, TRU: 2008
host_has_profile_pic 5 1 1.00 TRU: 16078, FAL: 33
host_identity_verified 5 1 0.67 TRU: 10720, FAL: 5391
neighbourhood_group_cleansed 16116 0 NaN :
bathrooms 16116 0 NaN :
calendar_updated 16116 0 NaN :
has_availability 0 1 0.96 TRU: 15457, FAL: 659
instant_bookable 0 1 0.23 FAL: 12408, TRU: 3708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.118184e+07 13520629.26 2.818000e+03 1.018020e+07 1.926593e+07 3.107548e+07 5.208280e+07 ▇▇▆▅▂
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 6.976052e+07 90271635.64 3.159000e+03 9.735558e+06 2.974134e+07 8.988321e+07 4.210037e+08 ▇▂▁▁▁
host_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
host_total_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
latitude 0 1.00 5.237000e+01 0.02 5.229000e+01 5.236000e+01 5.236000e+01 5.238000e+01 5.243000e+01 ▁▁▇▅▁
longitude 0 1.00 4.890000e+00 0.04 4.760000e+00 4.860000e+00 4.890000e+00 4.910000e+00 5.070000e+00 ▁▆▇▁▁
accommodates 0 1.00 2.840000e+00 1.31 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▃▁▁▁
bedrooms 898 0.94 1.530000e+00 0.95 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+01 ▇▁▁▁▁
beds 97 0.99 1.760000e+00 1.47 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.300000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 3.990000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.081500e+02 540.67 1.000000e+00 2.100000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▇▁▁▇▁
minimum_minimum_nights 3 1.00 3.950000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 3 1.00 4.150000e+00 21.03 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 3 1.00 6.855200e+02 532.48 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▆▁▁▇▁
maximum_maximum_nights 3 1.00 2.672465e+05 23924509.45 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 3 1.00 4.030000e+00 21.00 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 3 1.00 2.666400e+05 23870381.65 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.142625e+09 ▇▁▁▁▁
availability_30 0 1.00 4.030000e+00 8.70 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+01 ▇▁▁▁▁
availability_60 0 1.00 8.840000e+00 18.14 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 6.000000e+01 ▇▁▁▁▁
availability_90 0 1.00 1.432000e+01 28.51 0.000000e+00 0.000000e+00 0.000000e+00 5.000000e+00 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 5.532000e+01 107.91 0.000000e+00 0.000000e+00 0.000000e+00 4.700000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 2.465000e+01 56.71 0.000000e+00 2.000000e+00 8.000000e+00 2.200000e+01 8.770000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 1.390000e+00 7.48 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.220000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 3.300000e-01 2.03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.370000e+02 ▇▁▁▁▁
review_scores_rating 2087 0.87 4.690000e+00 0.67 0.000000e+00 4.670000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 2301 0.86 4.810000e+00 0.35 0.000000e+00 4.750000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 2300 0.86 4.700000e+00 0.44 0.000000e+00 4.600000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 2309 0.86 4.850000e+00 0.31 0.000000e+00 4.820000e+00 4.950000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 2304 0.86 4.870000e+00 0.30 0.000000e+00 4.850000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 2309 0.86 4.730000e+00 0.33 0.000000e+00 4.600000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 2309 0.86 4.600000e+00 0.39 0.000000e+00 4.500000e+00 4.670000e+00 4.830000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.640000e+00 2.40 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.100000e+00 1.87 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 4.900000e-01 1.53 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.13 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 ▇▁▁▁▁
reviews_per_month 2087 0.87 6.800000e-01 1.72 1.000000e-02 1.100000e-01 2.700000e-01 6.200000e-01 1.078400e+02 ▇▁▁▁▁

First out of interest we would like to plot a map of the raw dataset.

By plotting a cluster map that shows the number of flat listed for rent, we want to visualise the concentration of listings in the area of Amsterdam

#Map the concentration of listing in Amsterdam
listings %>% 
  leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1,
                   fillOpacity = 0.3, 
                   popup = ~listing_url,
                   label = ~property_type,
                   clusterOptions = markerClusterOptions()
                   )

From this map, we concluded that more properties are available in the central area of Amsterdam. We would like to further investigate the relationship between the price of listings and the locations of properties.

We select some variables for our analysis which we believe would be relevant to determine the impact on prices.

We will hypothesize that the following are key variables which will will be using later on

  • price: cost per night (variable we will be use to predict)

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • neighbourhood*: three variables on a few major neighbourhoods in each city

  • accomodations: How many people can stay in the room per night

  • beds: Number of beds

  • host_is_superhost: Whether the host has superhost status

  • minimum_nights: Minimum nights to stay in the Airbnb

  • bedrooms: Number of bedrooms

  • reviews_per_month: Number of reviews per month

  • availability_30: Available stays next 30 days

  • license: license number

  • instant_bookable: Whether the room can be instantly booked

If you are interest in further information is is available here

Let’s first filter for our variables of interest

#Select the variables for further exploration

listings2 <- listings %>% 
  
  select(c("host_since", 
           "host_is_superhost", 
           "neighbourhood_cleansed", 
           "property_type", 
           "room_type", 
           "accommodates",
           "price",
           "review_scores_rating", 
           "number_of_reviews",
           "minimum_nights", 
           "reviews_per_month",
           "bedrooms","beds",
           "availability_30",
           "last_review",
           "license",
           "instant_bookable"))

# We want to choose the "bathrooms" as well, however,this variable has no values

2.2 Data wrangling

Now let’s ensure we can work with our data.

First lets briefly look at our new dataset

skim(listings2)
Data summary
Name listings2
Number of rows 16116
Number of columns 17
_______________________
Column type frequency:
character 5
Date 2
logical 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 67 0
room_type 0 1.00 10 15 0 4 0
price 0 1.00 5 9 0 498 0
license 11561 0.28 6 63 0 3624 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 5 1.00 2008-09-24 2021-08-31 2015-04-07 3235
last_review 2087 0.87 2011-06-12 2021-09-07 2019-06-10 2019

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.12 FAL: 14103, TRU: 2008
instant_bookable 0 1 0.23 FAL: 12408, TRU: 3708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1.00 2.84 1.31 0.00 2.00 2.00 4.00 16.00 ▇▃▁▁▁
review_scores_rating 2087 0.87 4.69 0.67 0.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1.00 24.65 56.71 0.00 2.00 8.00 22.00 877.00 ▇▁▁▁▁
minimum_nights 0 1.00 3.99 20.99 1.00 2.00 2.00 3.00 1100.00 ▇▁▁▁▁
reviews_per_month 2087 0.87 0.68 1.72 0.01 0.11 0.27 0.62 107.84 ▇▁▁▁▁
bedrooms 898 0.94 1.53 0.95 1.00 1.00 1.00 2.00 50.00 ▇▁▁▁▁
beds 97 0.99 1.76 1.47 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1.00 4.03 8.70 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

We see some issues that we will try and resolve.

2.2.1 Price

First:

Price is sometimes given as a character string and we would like it to be a double variable.

Below we change the price to a double, i.e. a numeric variable.

#Change the 'price' to numeric
listings2 <- listings2 %>% 
  mutate(price = parse_number(price))

#Use `typeof(listing2$price)` to confirm that `price` is now stored as a number.
typeof(listings2$price)
[1] "double"

2.2.2 Licenses

We then want to impute the missing value of “license” with logical judgement instead of a license number.

This will make our analysis simply focus on whether people have a license or not.

#Impute the missing value of “license”
listings2<-listings2 %>% 
  
  #Add yes or no depending on whether variable is missing
  mutate(license = ifelse(is.na(license), "no","yes"))


#Let's check our data by filtering for our license
License_check <- listings2 %>% 
  select(license)


#Let's take a glimpse at the results.
glimpse(License_check)
Rows: 16,116
Columns: 1
$ license <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "y…

The code works as hoped.

2.2.3 Property types

Next, we look at the variable property_type. We use the count function to determine how many categories there are and their frequency.

This will be useful later to summarize our property types in larger categories as there will be more observations and less variables in the regression.

#Determine how many categories there are and their frequency
count_prop <- listings2 %>% 
  
  #Select variable of interest - i.e. property types
  select(property_type) %>% 
  
  #Count Number of each property type
  count(property_type) %>% 
  
  #Arrange by most frequent category
  arrange(desc(n)) %>% 
  
  #Create a new column for proportions
  mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up 

#Print Table
count_prop
property_typenpercentage
Entire rental unit977660.7    
Private room in rental unit185011.5    
Entire residential home10456.48   
Entire townhouse4292.66   
Entire condominium (condo)3542.2    
Private room in bed and breakfast3161.96   
Private room in residential home2961.84   
Entire loft2691.67   
Houseboat1861.15   
Boat1811.12   
Room in boutique hotel1741.08   
Private room in townhouse1560.968  
Entire serviced apartment1270.788  
Private room in houseboat1120.695  
Private room in guest suite1100.683  
Room in hotel1070.664  
Private room in condominium (condo)970.602  
Private room in boat940.583  
Private room in loft530.329  
Room in bed and breakfast360.223  
Entire guest suite340.211  
Entire villa300.186  
Shared room in rental unit240.149  
Private room in hostel230.143  
Room in aparthotel210.13   
Entire guesthouse200.124  
Room in hostel200.124  
Room in serviced apartment160.0993 
Private room150.0931 
Private room in guesthouse140.0869 
Entire place110.0683 
Private room in serviced apartment100.0621 
Entire cottage90.0558 
Shared room in hostel80.0496 
Tiny house80.0496 
Private room in farm stay70.0434 
Private room in tiny house60.0372 
Private room in villa60.0372 
Entire cabin50.031  
Barn40.0248 
Entire bed and breakfast40.0248 
Entire bungalow40.0248 
Private room in casa particular40.0248 
Camper/RV30.0186 
Entire home/apt30.0186 
Floor30.0186 
Private room in cabin30.0186 
Shared room in houseboat30.0186 
Shared room in residential home30.0186 
Campsite20.0124 
Entire chalet20.0124 
Private room in dome house20.0124 
Private room in earth house20.0124 
Private room in floor20.0124 
Private room in island20.0124 
Shared room in bed and breakfast20.0124 
Tower20.0124 
Yurt20.0124 
Bus10.00621
Cave10.00621
Earth house10.00621
Private room in bungalow10.00621
Private room in nature lodge10.00621
Room in casa particular10.00621
Shared room in boat10.00621
Shared room in loft10.00621
Tipi10.00621

The top 4 common property types are “Entire rental unit”,“Private room in rental unit”,“Entire residential home” and “Entire townhouse”. The top 4 types make up 60.66%, 11.48%, 6.48%, and 2.66% of the total listings

This will also make it easier for our regressions at a later stage given this is categorical variables.

#Simplify the "property_type" 
listings2 <- listings2 %>%
  
  #Create new names called other for properties not in the top four categories
  mutate(prop_type_simplified = case_when(
    
    #Specifying names to be kept
    property_type %in% c("Entire rental unit",
                         "Private room in rental unit", 
                         "Entire residential home", 
                         "Entire townhouse") ~ property_type, 
    
    #Specifying ot leave remaining in other
    TRUE ~ "Other"))

#We check the `prop_type_simplified` was correctly made
listings2 %>%
  
  #Count each category
  count(property_type, prop_type_simplified) %>%
  
  #Arrange by largest category
  arrange(desc(n))    
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit9776
Private room in rental unitPrivate room in rental unit1850
Entire residential homeEntire residential home1045
Entire townhouseEntire townhouse429
Entire condominium (condo)Other354
Private room in bed and breakfastOther316
Private room in residential homeOther296
Entire loftOther269
HouseboatOther186
BoatOther181
Room in boutique hotelOther174
Private room in townhouseOther156
Entire serviced apartmentOther127
Private room in houseboatOther112
Private room in guest suiteOther110
Room in hotelOther107
Private room in condominium (condo)Other97
Private room in boatOther94
Private room in loftOther53
Room in bed and breakfastOther36
Entire guest suiteOther34
Entire villaOther30
Shared room in rental unitOther24
Private room in hostelOther23
Room in aparthotelOther21
Entire guesthouseOther20
Room in hostelOther20
Room in serviced apartmentOther16
Private roomOther15
Private room in guesthouseOther14
Entire placeOther11
Private room in serviced apartmentOther10
Entire cottageOther9
Shared room in hostelOther8
Tiny houseOther8
Private room in farm stayOther7
Private room in tiny houseOther6
Private room in villaOther6
Entire cabinOther5
BarnOther4
Entire bed and breakfastOther4
Entire bungalowOther4
Private room in casa particularOther4
Camper/RVOther3
Entire home/aptOther3
FloorOther3
Private room in cabinOther3
Shared room in houseboatOther3
Shared room in residential homeOther3
CampsiteOther2
Entire chaletOther2
Private room in dome houseOther2
Private room in earth houseOther2
Private room in floorOther2
Private room in islandOther2
Shared room in bed and breakfastOther2
TowerOther2
YurtOther2
BusOther1
CaveOther1
Earth houseOther1
Private room in bungalowOther1
Private room in nature lodgeOther1
Room in casa particularOther1
Shared room in boatOther1
Shared room in loftOther1
TipiOther1
#Removing the old property type
listings2 <- subset(listings2, select = -property_type)

Our table comes out as expected and we can see that all smaller property types are in the other category.

#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>% 
  group_by(prop_type_simplified) %>% 
  
  #Calculate weight's summary statistics for people exercising at least 3 times a week 
  
  # calculate mean, SD, count, SE, lower/upper 95% CI
  summarise(
    average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(price,na.rm=TRUE), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 

formula_ci
prop_type_simplifiedaverage_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
Entire rental unit160  16697761.961.683.29157  163  
Entire residential home208  16910451.965.2310.3 198  219  
Entire townhouse218  1344291.976.4712.7 206  231  
Other160  20830161.963.797.44152  167  
Private room in rental unit93.113618501.963.176.2286.899.3
ggplot(formula_ci, aes(x=average_price, y=prop_type_simplified, color=prop_type_simplified)) +

#geom_errorbar function allows us to show the two bars with confidence intervals

geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+ 
  
geom_point(aes(x=average_price),size=1)+

  
theme_bw() +
  
theme(legend.position = "none",axis.title.y=element_blank())+
  
  labs(title = "Confidence interval for average price per property type",
       subtitle = "95% confidence intervals overlap",
       x = "Average Price"
       ) +
  NULL

The property type seems to have a significant effect on prices. The length of the intervals mainly vary due to the differences in sample size. The most common property type (Entire rental unit) has a small interval because it has the larger sample. We can estimate quite aaccuratly the average price in this category.

2.2.4 Neighbourhood

Next, we look at the variable neighbourhood_cleansed. We use the count function to determine how many categories there are and their frequency. The top 5 common neighbourhoods are “De Baarsjes - Oud-West”,“De Pijp - Rivierenbuurt”,“Centrum-West”, “Centrum-Oost” and “WesterPark”. The top 5 types make up 16.8%, 12.3%, 10.8%, 8.51%%, and 7.48% of the total listings

#Determine how many categories there are and their frequency
count_prop <- listings2 %>% 
  
  #Selecting variable of interest
  select(neighbourhood_cleansed) %>% 
  
  #Counting observations 
  count(neighbourhood_cleansed) %>% 
  
  #Arrange by observations
  arrange(desc(n)) %>% 
  
  #In Percentage
  mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up 

#Print Table
count_prop
neighbourhood_cleansednpercentage
De Baarsjes - Oud-West270116.8  
De Pijp - Rivierenbuurt198612.3  
Centrum-West174610.8  
Centrum-Oost13728.51 
Westerpark12057.48 
Zuid11967.42 
Oud-Oost10326.4  
Bos en Lommer9315.78 
Oostelijk Havengebied - Indische Buurt7334.55 
Oud-Noord5183.21 
Watergraafsmeer4762.95 
IJburg - Zeeburgereiland3962.46 
Slotervaart3492.17 
Noord-West3232    
Noord-Oost2271.41 
Buitenveldert - Zuidas2171.35 
Geuzenveld - Slotermeer2001.24 
De Aker - Nieuw Sloten1160.72 
Osdorp1140.707
Gaasperdam - Driemond1070.664
Bijlmer-Centrum910.565
Bijlmer-Oost800.496

Looking into this we see that there are many more high observation neighbourhoods compared to property types.

Therefore, we believe it makes more sense to split the neighbourhood by their relative prices.

We do this below

#Setting neighbourhood by price
neighbourhood_group <- listings2 %>% 
  
  #Variable of interest
  group_by(neighbourhood_cleansed) %>% 
  
  #Getting median price
  summarize(price_median = median(price)) %>% 
  
  #Arranging by highest price
  arrange(desc(price_median))

#Print
neighbourhood_group
neighbourhood_cleansedprice_median
Centrum-Oost150
Centrum-West150
IJburg - Zeeburgereiland146
De Pijp - Rivierenbuurt140
Zuid134
De Baarsjes - Oud-West129
Oud-Oost129
Westerpark129
Oud-Noord125
Watergraafsmeer125
Noord-West120
Noord-Oost119
Oostelijk Havengebied - Indische Buurt114
Buitenveldert - Zuidas112
Bos en Lommer110
Geuzenveld - Slotermeer100
Slotervaart100
De Aker - Nieuw Sloten98
Osdorp91
Gaasperdam - Driemond85
Bijlmer-Centrum83
Bijlmer-Oost79

We see that some central neighbourhoods are more expensive, which logically makes sense given this is a large city and that is where tourists and other activities are.

Let’s categorize neighbourhood by price

#Simplify the "property_type" 
listings2 <- listings2 %>%
  
  #Adding new variable names
  mutate(neighbourhood_cleansed = case_when(
                                  neighbourhood_cleansed == "Centrum-Oost" ~ "Top 5",
                                  neighbourhood_cleansed == "Centrum-West" ~ "Top 5",
                                  neighbourhood_cleansed == "IJburg - Zeeburgereiland" ~ "Top 5",
                                  neighbourhood_cleansed == "Zuid" ~ "Top 5",
                                  neighbourhood_cleansed == "De Pijp - Rivierenbuurt" ~ "Top 5",
                                  neighbourhood_cleansed == "Oud-Noord" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Watergraafsmeer" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Oud-Oost" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Westerpark" ~ "Top 6-10",
                                  neighbourhood_cleansed == "De Baarsjes - Oud-West" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Noord-West" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Noord-Oost" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Oostelijk Havengebied - Indische Buurt" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Buitenveldert - Zuidas" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Bos en Lommer" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Geuzenveld - Slotermeer" ~ "Remaining",
                                  neighbourhood_cleansed == "Slotervaart" ~ "Remaining",
                                  neighbourhood_cleansed == "De Aker - Nieuw Sloten" ~ "Remaining",
                                  neighbourhood_cleansed == "Osdorp" ~ "Remaining",
                                  neighbourhood_cleansed == "Gaasperdam - Driemond" ~ "Remaining",
                                  neighbourhood_cleansed == "Bijlmer-Centrum" ~ "Remaining",
                                  neighbourhood_cleansed == "Bijlmer-Oost" ~ "Remaining"))
                                  
                                
#Check the `prop_type_simplified` was correctly made
listings2 %>%
  
  #Counting by number of observations
  count(neighbourhood_cleansed) %>%
  
  #Displaying in descending order
  arrange(desc(n))    
neighbourhood_cleansedn
Top 56696
Top 6-105932
Top 11-152431
Remaining1057

2.2.5 Minimum nights

Let’s first look into how minimum nights look at Airbnb Amsterdam.

#Find out the most common values for the variable `minimum_nights`
listings2 %>%
  
  #Count Observations
  count(minimum_nights) %>%
  
  #Arrange by observations
  arrange(desc(n))
minimum_nightsn
2      5993
3      3918
1      2658
4      1367
5      895
7      481
6      234
14      109
10      92
30      63
8      33
21      32
20      30
25      19
12      18
15      18
28      16
60      15
9      12
90      11
13      7
59      7
180      7
11      4
19      4
23      4
29      4
50      4
58      4
100      4
16      3
27      3
31      3
200      3
300      3
365      3
18      2
26      2
42      2
45      2
99      2
150      2
1e+03      2
24      1
33      1
37      1
38      1
63      1
70      1
75      1
80      1
89      1
94      1
95      1
120      1
181      1
183      1
186      1
222      1
240      1
500      1
999      1
1e+03      1
1.1e+031

By looking into the vairable minimum_nights, we found out that airbnb most commonly required people to stay for at least 2 nights. The majority of users tend to spend less than a week, which indicates that Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels..

There are also some variables that stand out i.e. minimum nights of 1,100 days (3 years minimum rent). These individuals are probably looking for long term renters.

We only want to include listings in our regression analysis that are intended for travel purposes, i.e. no more than 4 days.

#Filter the Airbnb data so that it only includes observations with `minimum_nights <= 4`
listings2 <- listings2 %>% 
  filter(minimum_nights <= 4)

2.2.6 Dropping NA’s

We then skim the dataset to see if further data wrangling is required.

skim(listings2)
Data summary
Name listings2
Number of rows 13936
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 5 1.00 2008-09-24 2021-08-31 2015-05-04 3121
last_review 1602 0.89 2012-05-12 2021-09-07 2019-06-16 1920

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.13 FAL: 12118, TRU: 1813
instant_bookable 0 1 0.24 FAL: 10656, TRU: 3280

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1.00 2.82 1.31 0.00 2.00 2.00 4.00 16.00 ▇▃▁▁▁
price 0 1.00 155.55 164.51 0.00 96.00 130.00 180.00 8000.00 ▇▁▁▁▁
review_scores_rating 1602 0.89 4.69 0.65 0.00 4.67 4.85 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1.00 26.39 59.57 0.00 2.00 8.00 23.00 877.00 ▇▁▁▁▁
minimum_nights 0 1.00 2.29 0.88 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 1602 0.89 0.72 1.78 0.01 0.11 0.28 0.65 107.84 ▇▁▁▁▁
bedrooms 814 0.94 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 85 0.99 1.74 1.49 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1.00 4.07 8.71 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

As we can see, review_scores_rating, reviews_per_month, host_is_superhost, bedrooms and beds still have missing values. We will drop unreviewed units for now and as these units will likely be less accurate in terms of price / review ratings. Similarly for bedrooms and beds, we drop missing values.

#Dropping values

listings2 <-  listings2 %>% 
  
  #Choosing which variables to drop NAs from
  drop_na(bedrooms,beds,host_is_superhost,review_scores_rating,reviews_per_month)

#Skimming to confirm code works as intended. 
skim(listings2)
Data summary
Name listings2
Number of rows 11586
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-09-24 2021-08-11 2015-03-04 2921
last_review 0 1 2012-05-12 2021-09-07 2019-06-14 1891

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.14 FAL: 9945, TRU: 1641
instant_bookable 0 1 0.22 FAL: 9018, TRU: 2568

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1 2.85 1.30 1.00 2.00 2.00 4.00 16.00 ▇▁▁▁▁
price 0 1 154.60 143.04 4.00 99.00 130.00 180.00 8000.00 ▇▁▁▁▁
review_scores_rating 0 1 4.70 0.65 0.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1 29.16 61.35 1.00 4.00 10.00 26.00 866.00 ▇▁▁▁▁
minimum_nights 0 1 2.32 0.87 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 0 1 0.69 1.79 0.01 0.11 0.28 0.62 107.84 ▇▁▁▁▁
bedrooms 0 1 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 0 1 1.76 1.51 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1 3.89 8.47 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

2.2.7 Dropping Rating 0

We also know that ratings on Airbnb are 1-5, therefore there must be an issue with reviews that are 0 out of 5.

Let’s remove these from our data

#Dropping reviews at 0
listings2 <- listings2 %>% 
  filter(review_scores_rating > 0)

2.2.8 Final skim of data

Let’s take a final skim of the data

skim(listings2)
Data summary
Name listings2
Number of rows 11426
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-09-24 2021-08-11 2015-03-01 2911
last_review 0 1 2012-05-12 2021-09-07 2019-06-16 1877

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.14 FAL: 9785, TRU: 1641
instant_bookable 0 1 0.22 FAL: 8889, TRU: 2537

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1 2.85 1.30 1.00 2.00 2.00 4.00 16.00 ▇▁▁▁▁
price 0 1 153.68 122.98 4.00 99.00 130.00 180.00 7999.00 ▇▁▁▁▁
review_scores_rating 0 1 4.77 0.33 1.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1 29.55 61.69 1.00 4.00 11.00 26.00 866.00 ▇▁▁▁▁
minimum_nights 0 1 2.32 0.87 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 0 1 0.70 1.80 0.01 0.12 0.28 0.63 107.84 ▇▁▁▁▁
bedrooms 0 1 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 0 1 1.76 1.51 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1 3.89 8.47 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

We now have wrangled our data sufficiently and have: - 4 character variables ready for analysis with not too many categories. - 2 Date variables with no missing values - 2 Logical variables with no missing variables - 9 range of numeric variables with no missing values.

We also have 11.4 thousand observations.

We will now explore how these variables affect price.

2.3 Visualisations

By visualising the relationship between price and the variables that we believe will affect the price substantially, we hope to identify a noticeable correlation between price and each of those variables and use statistical modelling to further investigate their relationships and provide possible explanation.

2.3.1 Overview of Price

We first take a look at the price.

#Overview of price
favstats(listings2$price)
minQ1medianQ3maxmeansdnmissing
4991301808e+03154123114260

We see that price ranges from 4 to 8,000 per night with a mean of 154 and median of 130 (i.e. some left skew)

We create the density plot of price.

#Create a density plot of prices
ggplot(listings2,aes(x=price))+
  
  #Using a density plot
  geom_density()+
  
  #Adding useful titles
  labs(title="Density plot of listing Airbnb prices",
       x="Price",
       y="Density") + 
  
  #Minimal theme
  theme_minimal()

#Create a density plot of log of prices
ggplot(listings2,aes(x=log(price)))+
  
  #Density plot
  geom_density()+
  
  #Useful Labes
  labs(title="Density plot of listing Airbnb (log) prices",
       x="Log Price",
       y="Density") + 
  
  #Minimal Theme
  theme_minimal()

From here forward we will use Log which is more normally distributed. We can see that there are some peaks and lows on the graph which we connect to well-known disposition of people to round numbers, i.e. setting price to be USD 140 instead of USD 137 or USD 142. We assume that the graph would be smooth and almost perfectly normally distributed without this feature

2.3.2 Location/ Neighbourhood

First, we skimmed the neighbourhood_cleansed data to analyse the distribution of price of listings depends on neighbourhood. From the table below, we found out that median price will better suits the purpose of our research than the mean price because the distribution of price of each neighbourhood is greaatly skewed with extreme outliers. For example all neighbourhoods have lower median than mean price while the most expensive property costs USD8,000, causing a large standard deviation. Hence, standard deviation is unusually large among neighbourhood.

#Analyse the distribution of price of listings depends on neighbourhood
favstats(price~neighbourhood_cleansed,data=listings2) %>% 
  
  #Arrange by count
  arrange(desc(n)) 
neighbourhood_cleansedminQ1medianQ3maxmeansdnmissing
Top 541041492002e+03172113  46470
Top 6-1010991301758e+03151149  43490
Top 11-15258911815080012866.517050
Remaining22659513699911384.27250

We then plot a density graph to see the price distribution of the most popular neighbourhood.

#Create the dataset of the top neighbourhood and the price
listings_neigh<-listings2 %>% 
  
  #Group by variable we are looking at
  group_by(neighbourhood_cleansed) %>% 
  
  #ANd get the price
  summarize(price=price)

listings_neigh$new <- factor(listings_neigh$neighbourhood_cleansed, levels = c("Top 5", "Top 6-10", "Top 11-15", "Remaining"))

#Plot the density graph
ggplot(listings_neigh,aes(x=log(price), y = desc(new), fill = new))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Difference in price by region expensiveness",
      subtitle="Density plot for prices in different region groups",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #removing y-axis 
    theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  
  #Title of Legend
  guides(fill=guide_legend(title="Neighbourhood regions"))+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

The graph shows that the neighbourhoods vary in price and will likely be a good predictor for our analysis later on.

Let’s look at the median price by neighbourhood as well:

#Calculate the median price of different neighborhood
median_per_neighborhood <- listings2 %>% 
  
  #Grouping by variable
  group_by(neighbourhood_cleansed) %>% 
  
  #We want median price
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_neighborhood,
       
       #Price versus neighbourhood
       aes(x = reorder(neighbourhood_cleansed, median_price),
           y= median_price)) +
  
  #Columns chart
  geom_col(fill = "skyblue") +
  
  #USeful labes
  labs(
    title = "Median price per night per Neighborhood",
    x="Neighborhood",
    y="Median Price"
    ) +
  
  #To flip axis
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

Plotting on a median price basis indicates that these prices are still indeed impacted by how close they are to center.

2.3.3 Number of Bedrooms

We skimmed our data to summarise the price distribution based on number of bedrooms.

From the table below, we noticed that listings with more than five bedrooms have less than 10 samples, therefore, we decided to only include properties with up to five bedrooms. Furthermore, we found out that median price will better suit the purpose of our research than the mean price because the distribution of price based on number of bedrooms is greatly skewed due to extreme outliers. Therefore, we believe that the median price gives better understanding of the relationship between the price and the number of bedrooms

#Analyse the distribution of price of listings depends on number of bedrooms
favstats(price~bedrooms, data=listings2)

bedroomsminQ1medianQ3maxmeansdnmissing
14861121498e+03       124113  74200
2421301722252e+03       18994.928280
3291501992701.92e+03232143  8530
4251942503501.16e+03293165  2600
5125247330478814       383191  400
650409443493550       389197  50
7487487487487487       487  10
850162530561811       422308  60
9557557557557557       557  10
10443696725797857       704159  50
1283838383857       212316  60
15120120120120120       120  10
The graph below shows that the listings having more bedrooms have a higher median price.

We would like to plot this information

#Filter the bedrooms under 5 to remove outliers
listings2 <- listings2 %>% 
  filter(bedrooms<=5)

#Median calculations
median_per_bedroom <- listings2 %>% 
  
  #By Bedroom
  group_by(bedrooms) %>% 
  
  #Summarized by median price
  summarise(median_price = median(price))

#Plotting the graph
ggplot(median_per_bedroom,
       
       #Bedroom by price
       aes(x = reorder(bedrooms,median_price),
           
           #Price on Y Axis
           y= median_price)) +
  
  #in Column format
  geom_col(fill = "orchid3") +
  
  #With useful labelss
  labs(
    title = "Median price per night per bedrooms",
    x="Number of Bedrooms",
    y="Median Price"
    ) +
  
  #To flip axis
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

We see a strong relationship with number of bedrooms and price.

We assume number of bedrooms will have high correlation with accommodation and number of beds and will therefore leave those two variables out for now.

2.3.4 Ratings

We skimmed our data to summarise the price distribution based on the rating score.

#Analyse the distribution of price of listings depends on the rating score
favstats(price~review_scores_rating,data=listings2)
review_scores_ratingminQ1medianQ3maxmeansdnmissing
14761.299.5163  400       145  120       140
26079.892.5116  225       112  54.6     120
2.33121121  121  121  121       121         10
2.56585  105  122  140       103  37.5     30
2.75130130  130  130  130       130         10
2.8149149  149  149  149       149         10
340100  130  166  579       146  89.4     630
3.256567.570  75  80       71.77.64    30
3.29120120  120  120  120       120         10
3.3350120  130  150  200       130  54.3     50
3.4133133  133  133  133       133         10
3.530100  117  145  345       138  80.8     230
3.5691112  133  154  175       133  59.4     20
3.578484  84  84  84       84         10
3.67085  100  138  176       115  54.6     30
3.63125134  144  154  163       144  26.9     20
3.672983.8115  165  286       133  69.6     160
3.69129129  129  129  129       129         10
3.71124148  172  196  220       172  67.9     20
3.7595135  150  219  373       180  87.9     90
3.789090  90  90  90       90         10
3.896126  152  187  262       163  59.2     60
3.8383103  114  122  130       110  20.1     40
3.866573.882.591.2100       82.524.7     20
3.8860102  144  322  500       235  234       30
3.8980122  165  208  250       165  120       20
42590  120  150  999       137  88.2     2910
4.036066.272.578.885       72.517.7     20
4.04138138  138  138  138       138         10
4.054859  70  120  171       96.365.6     30
4.0777100  145  150  155       125  34.8     50
4.0896100  101  125  760       236  293       50
4.095882  110  132  140       104  37.8     40
4.149174  298  349  400       249  181       30
4.116087  100  160  230       127  56.6     90
4.137880  121  199  461       164  111       130
4.146585  130  140  760       164  182       130
4.155483.595.5120  180       105  43.5     60
4.16140140  140  140  140       140         10
4.1785100  140  160  599       162  119       170
4.1837108  132  167  200       132  52.9     80
4.1975112  150  225  300       175  115       30
4.24979.2104  137  400       119  70.5     300
4.2150100  150  165  180       127  68.1     30
4.2263111  146  198  400       161  87.8     140
4.234985.5100  131  453       135  111       110
4.243463.884.5141  214       102  60.3     80
4.253581  103  144  599       131  98       580
4.266179.598  116  135       98  52.3     20
4.275285.2100  121  297       114  64.4     120
4.286069.879.589.299       79.527.6     20
4.294688.5118  146  229       123  46.6     180
4.33984.297.5135  200       109  46.4     160
4.315666  89  100  319       109  73.2     130
4.323347.255  112  130       74.538.7     80
4.334081.5118  152  1.15e+03139  120       1240
4.344555  60  68  139       73.437.6     50
4.354585  147  195  275       148  76.4     130
4.367099.8117  150  202       123  35.9     160
4.376685  99  100  125       94.319.2     70
4.3835108  130  171  252       137  50.8     280
4.395973.5130  176  8e+03       840  2.38e+03110
4.42279.8124  168  740       143  108       520
4.413385.5125  160  270       133  66.2     150
4.424289.8125  148  250       127  53.7     220
4.4334100  122  161  430       135  65.5     480
4.445894.2115  149  283       128  54.7     240
4.4538105  143  206  410       166  93.3     250
4.464686  130  186  250       141  65.8     220
4.474488.8102  158  500       148  111       240
4.4865105  130  158  329       154  81.6     180
4.495880  90  118  188       99.836.1     110
4.52990  120  160  2e+03       141  120       4300
4.514391.5122  162  264       135  77.3     60
4.526880  103  140  707       151  155       170
4.534582  105  150  240       118  49.1     410
4.543366  96  145  286       106  52.6     330
4.552880  113  163  389       131  74.1     510
4.564798.8124  152  250       131  49.1     640
4.572581.5108  150  740       141  111       940
4.586598.8130  162  895       162  134       480
4.595081.8130  193  464       146  88       340
4.64095  125  164  599       140  81.9     1360
4.615889.5110  172  695       149  115       350
4.622675  107  150  450       122  74.7     420
4.632991.5120  164  550       143  85.5     1150
4.642689.8130  186  1.07e+03166  149       680
4.653291.2133  188  367       147  82.5     420
4.663063.5109  180  399       131  90.3     230
4.671790  120  164  599       138  75.7     4350
4.684098.5125  163  800       154  110       630
4.693090  115  168  450       140  77.7     880
4.73599  130  172  350       140  65.3     950
4.714299  120  153  400       135  59.1     1650
4.723384.5117  144  300       128  65.1     560
4.733085  120  170  800       144  101       1240
4.743083  120  166  500       141  86.4     710
4.753795  129  180  829       148  87.4     3600
4.7634122  145  180  500       158  78.7     730
4.775095  125  162  729       146  92.8     990
4.782596  130  190  1.16e+03155  112       1530
4.792796  146  200  518       163  89.4     1370
4.825100  120  173  760       145  78.2     3080
4.812892  124  164  463       146  87.9     1410
4.822098.2135  193  699       156  92.9     1580
4.8329100  135  186  800       157  89.8     2900
4.842590.8129  166  534       145  86.3     1040
4.85495  124  194  550       149  86.1     1510
4.8645100  139  198  950       161  96       2680
4.872690  124  179  440       139  68       1450
4.8847100  125  194  1.19e+03158  101       2980
4.893896.5139  180  550       156  90.6     2510
4.931100  131  186  500       154  77.4     2160
4.9127100  140  190  500       157  79.7     1970
4.9230100  135  185  900       160  97.5     2420
4.9350109  150  200  2e+03       172  146       2200
4.9435106  144  200  745       171  104       1760
4.9550102  142  200  504       164  82.6     1400
4.9628100  136  196  399       157  80.2     1220
4.974399  139  199  350       154  67.8     930
4.984595  136  198  950       183  160       440
4.9980110  139  156  279       146  61.6     90
510101  140  190  1.92e+03163  104       34040

By drawing a regression line between the median price and the ratings, we concluded that they share a positive correlation, which means higher ratings have higher median price, and vice versa.

We will use two different methods for this, first we will use the median for each rating creating a smaller subset of datapoints:

##Create the data of price and score rating
median_per_rating <- listings2 %>% 
  
  #Group by review
  group_by(review_scores_rating) %>% 
  
  #Summarize by median price
  summarise(median_price = median(price)) %>% 
  
  #Exclude NA's 
  na.omit()


#Plotting
ggplot(median_per_rating,
       
       #Review scores vs median price
       aes(x = review_scores_rating,
           y= median_price)) +
  
  #Coloring red dots
  geom_point(color="red") +
  
  #Adding useful titles
  labs(title = "Median price per night vs. Ratings",
    x="Ratings",
    y="Median Price"
    ) +
  
  #Adding a trend line
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simple theme
  theme_bw() +
  
  ylim(0, 200) + 
  NULL

We see some relationship between price and ratings. Let’s explore this further and see if this depends on the room type.

#Plotting by rating but log of price
ggplot(listings2, aes(x = review_scores_rating, y = log(price), color = room_type)) + 
  
  #In red color 
  geom_point(alpha = 0.5) + 
  
  #And adding a trendline
  geom_smooth(se = F, method = "lm", color = "black") + 
  
  #In minimal Theme
  theme_bw() + 
  
    #Adding useful titles
  labs(title = "Log price per night vs. Ratings",
    x="Ratings",
    y="Log Price"
    ) +
  
  facet_wrap(~ room_type, scales = "free") + 
  
  guides(color=F) + 
  
  NULL

Interestingly, especially hotel rooms are sensitive to review ratings.

2.3.5 Superhosts

Let’s first look at a key summary

#Analyse the distribution of price of listings depends on the rating score
favstats(price~host_is_superhost,data=listings2)
host_is_superhostminQ1medianQ3maxmeansdnmissing
FALSE4991301808e+03154126  97670
TRUE278812718090015092.616340

Interestingly, non-superhosts seem to have a higher price than superhosts. Let’s plot some relationships to explore this further.

#Create the data of price, score rating, and superhost
median_per_rating1 <- listings2 %>% 
  
  #grouping by review score
  group_by(review_scores_rating) %>% 
  
  #Summarizing by price and superhost
  summarise(median_price = median(price),
            host_is_superhost=host_is_superhost) %>% 
  
  #Excluding NA's
  na.omit()

#Plotting the dataset 
ggplot(median_per_rating1,
       
       #By same as above but including host is superhost
       aes(x = review_scores_rating,
           y= median_price, 
           colour = host_is_superhost)) +
  
  #Adding scatterplot
  geom_point() +
  
  #Adding useful labels
  labs(title = "Median price per night vs. Ratings",
    x="Ratings",
    y="Median Price"
    ) +
  
  #Adding Trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding a simple theme
  theme_bw() +
  
  #Adding a legend
  guides(color=guide_legend(title="Host is Superhost"))+
  
  #Wrapping by host type to get two different charts
  facet_wrap(~host_is_superhost) +
  
  NULL

Although both mean and median price for listings by non-superhost were higher than that by superhost, the regression analysis above shows that the correlation between ratings and median prices is higher for superhost. Therefore, the median price of listings by superhost is more affected by the change in ratings than that by non-superhost.

Now let’s also take a look at the effect from superhosts themselves:

#Plotting by host type
ggplot(listings2,
       
       #Adding key variables
       aes(x=host_is_superhost,
           y=log(price), 
           fill = host_is_superhost))+
  
  #Looking at a boxplot for distribution
  geom_boxplot()+
  
  #Adding useful labels
  labs(title = "Superhost versus non superhosts price distributions",
    x="Superhost",
    y=" Log Price"
    )+
  
  #Removing legend
  guides(fill = F) + 
  
  #Simple Theme
  theme_bw() + 
  
  NULL

With more extreme outliers, the box plot of non-superhosts shows higher standard deviation compared to that of superhost.Let’s confirm this:

#Plot the density graph
ggplot(listings2,
       aes(x=log(price), 
           y = host_is_superhost, 
           fill = host_is_superhost))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by host type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

Based on the graph above, the distribution of superhost seems to have a greater variation. However, the distribution of superhost has a lower standard deviation than that of non-superhost. Therefore, we believe that the graph above is greatly influenced by the number of samples included, i.e. 1,634 superhost vs 9,767 non-superhost.

2.3.6 Number of reviews

Similarly we can graph a relationship between number of reviews and price.

As this is a numerical variable with many ranges we will not use favstats to look into this.

ggplot(listings2, 
       
       #Plotting reviews by price
       aes(x = number_of_reviews,
           y= log(price))) +
  
  #Adding color to the chart
  geom_point(color="red", alpha = 0.3) +
  
  #Adding useful labels
  labs(title = "Price per night vs. number of Reviews",
    x="Number of Reviews",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding BW Theme
  theme_bw() +
  
  NULL

We see that as the number of reviews increases the price falls.

Let’s check if this holds across categorical variables

ggplot(listings2, 
       
       #Plotting reviews by price
       aes(x = number_of_reviews,
           y= log(price),
           color = room_type)) +
  
  #Adding color to the chart
  geom_point(alpha = 0.5) +
  
  #Adding useful labels
  labs(title = "Price per night vs. number of Reviews",
    x="Number of Reviews",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding BW Theme
  theme_bw() +
  
  #Wrapping by host type to get two different charts
  facet_wrap(~room_type, scales ="free") + 
  
  guides(color = F) + 
  
  NULL

This is super interesting to see because we see that the relationship depends on room types. Therefore it is important to control for room type when looking at the effect of number of reviews!

2.3.7 License

Let’s first look at a summary of the data:

#Analyse the distribution of price of listings depends on the rating score
favstats(price~license,data=listings2)

licenseminQ1medianQ3maxmeansdnmissing
no4991291758e+03       15113080310
yes25961361951.19e+0315910133700
We see that the listings with license have higher mean and median prices.

By graphing a boxplot, we see that properties with license have higher Q1, median and Q3 value than properties without license. Hence, we conclude that properties with licence are more expensive than the others.

#Plotting the data
ggplot(listings2,
       aes(x=license,
           y=log(price)))+
  
  #In Wheat Color
  geom_boxplot(fill="wheat2")+
  
  #Useful labels
  labs(title = "Price per night vs. License",
    x="License",
    y=" Price"
    )+
  
  #BW Theme
  theme_bw()+ 
  
  NULL

We see that there is a higher spread on prices of properties without licenses and a slightly lower median price.

2.3.8 Room type

#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)
room_typeminQ1medianQ3maxmeansdnmissing
Entire home/apt41101471998e+03169  128  87310
Hotel room2683120156900138  135  800
Private room1765851152e+0399.574  25570
Shared room2037658233686  71.5330

There seems to be an extremely strong relationship between room type and price. This is good for our analysis but let’s first plot it further.

First let’s visualize the median price:

#Calculate the median price of different room types
median_per_room <- listings2 %>% 
  
  #By Room
  group_by(room_type) %>% 
  
  #In median Prices
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_room,
       
       #Reorder by price
       aes(x = reorder(room_type, median_price),
           
           #Price on Y Axis
           y= median_price)) +
  
  #Column Chart
  geom_col(fill = "skyblue") +
  
  #Useful Labels
  labs(
    title = "Median price by room_type",
    x="Room Type",
    y="Median Price"
    ) +
  
  #Flipping for aesthetics
  coord_flip() +
  
  #Simple theme for aesthetics
  theme_bw() +
  NULL

We see that entire homes and hotel rooms are especially expensive. We would like to plot the density of this to explore it further:

#Density Plot
ggplot(listings2,
       aes(x=log(price), 
           y = room_type, 
           fill = room_type))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by room type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

Interestingly although entire homes or apartments are the most expensive in terms of median, we see that some hotels are also extremely expensive.

We also see that these distributions are multimodal - i.e. prices are clustered around specific levels.

#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>% 
  group_by(room_type) %>% 
  
  #Calculate weight's summary statistics for people exercising at least 3 times a week 
  
  # calculate mean, SD, count, SE, lower/upper 95% CI
  summarise(
    average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(price,na.rm=TRUE), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 

formula_ci
room_typeaverage_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
Entire home/apt169  128  87311.961.372.69167  172
Hotel room138  135  801.9915.1 30   108  168
Private room99.574  25571.961.462.8796.6102
Shared room86  71.5332.0412.4 25.4 60.6111
ggplot(formula_ci, aes(x=average_price, y=room_type, color=room_type)) +

#geom_errorbar function allows us to show the two bars with confidence intervals

geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+ 
  
geom_point(aes(x=average_price),size=1)+

  
theme_bw() +
  
theme(legend.position = "none",axis.title.y=element_blank())+
  
  labs(title = "Confidence interval for average price per room type",
       subtitle = "95% confidence intervals overlap",
       x = "Average Price"
       ) +
  NULL

The confidence intervals’ lengths are very different between each room type. Shared and hotel rooms have wide widths. There is a big uncertainty related to their average price. On the other side, we can estimate quite accurately the entire and private homes’ average.

2.3.9 Property Type

Let us look into the statistics first:

#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)
room_typeminQ1medianQ3maxmeansdnmissing
Entire home/apt41101471998e+03169  128  87310
Hotel room2683120156900138  135  800
Private room1765851152e+0399.574  25570
Shared room2037658233686  71.5330

There seems to be a very strong relationship between property type and price. Lets plot the medians again:

#Calculate the median price of different room types
median_per_prop <- listings2 %>% 
  group_by(prop_type_simplified) %>% 
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_prop,
       aes(x = reorder(prop_type_simplified, median_price),
           y= median_price)) +
  
  #Column type
  geom_col(fill = "skyblue") +
  
  #Useful labels
  labs(
    title = "Median price by property type",
    x="Room Type",
    y="Median Price"
    ) +
  
  #Flipping for aesthetics
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

This is very useful but let’s also take a look at the distribution:

#Density Plot
ggplot(listings2,
       aes(x=log(price), 
           y = prop_type_simplified, 
           fill = prop_type_simplified))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by property type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

There is a very distinct pattern between property types and this will likely be a useful regressor for our analysis.

2.3.10 Avaialability next 30 days

Let’s also look at availability next 30 days which we hypothesize will be an indicator of popularity.

#Summarizing median price by availability
listings_availability<-listings2 %>% 
  group_by(availability_30) %>% 
  summarize(price=median(price))

#Plotting scatterplot of availability 
ggplot(listings_availability, 
       aes(x = availability_30,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red") +
  
  #Useful Labels
  labs(title = "Log Price Per Night versus Availability",
    x="Availability next 30 days",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

The more availability, the higher the price, which indicates that more premium rooms are booked less often or are less often fully booked.

This goes against our initial thought that less availability would be popular rooms and therefore more expensive! Good that we looked into this.

2.3.11 Instant Bookable

Let’s also look at whether or not units are instantly bookable

#Analyse the distribution of price of listings depends on the rating score
favstats(price~instant_bookable,data=listings2)
instant_bookableminQ1medianQ3maxmeansdnmissing
FALSE41001301802e+0315494.788690
TRUE25891251758e+03150188  25320

Instant bookable seems to be slightly higher distributed. Let’s take a look at the graph:

#Boxplotting instant bookable vs log prices
ggplot(listings2,aes(x=instant_bookable,y=log(price)))+
  
  #Boxplot format
  geom_boxplot(fill="wheat2")+
  
  #Useful labels
  labs(title = "Distribution of prices depending on whether or not units are instantly bookable",
    x="Instantly Bookable",
    y="Log Price"
    )+
  
  #Simople Theme
  theme_bw()

There seems to be little to no difference between whether or not the units are bookable, however, we will still look further into this in the regression.

2.3.12 Host Since

Let’s also look at when they have been host since:

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = host_since,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red", alpha = 0.2) +
  
  #Useful Labels
  labs(title = "Log price per night versus how long the host has been active on Airbnb",
    x="Host since (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

There seems to be little to no relationship here and we will likely not need to consider it going forward.

Let’s check across categoris

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = host_since,
           y= log(price), 
           color = prop_type_simplified)) +
  
  #Points in red
  geom_point(alpha = 0.5) +
  
  #Useful Labels
  labs(title = "Log price per night versus how long the host has been active on Airbnb",
    x="Host since (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  
  facet_wrap(~prop_type_simplified, scales = "free") + 
  
  guides(color = F) + 
  
  NULL

This seems to hold true across a variety of categories and we will not consider it going forward.

2.3.13 Last Review

Let’s also look at whether last review date has an impact

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red", alpha = 0.2) +
  
  #Useful Labels
  labs(title = "Log price per night versus when the last review was",
    x="Last Review (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

We see a slight relationship between price and the date of last review. We also noticed that some properties do not have a recent review probably due to covid-19 outbreak.

We will plot the density for reference

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review)) + 
  
  geom_density(fill = "blue", alpha = 0.2) + 
  
  #Useful Labels
  labs(title = "Distributions of last reviews",
    x="Last Review (date)",
    ) +

  #Simplified theme
  theme_bw() +
  NULL

Although many properties did not receive a review during the Covid-19 outbreak, we will still include these data in our analysis.

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review,
           y= log(price), 
           color = prop_type_simplified)) +
  
  #Points in red
  geom_point(alpha = 0.5) +
  
  #Useful Labels
  labs(title = "Log price per night versus when the last review was",
    x="Last Review (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  facet_wrap(~prop_type_simplified, scales = "free", ncol = 3) + 
  
  #Simplified theme
  theme_bw() +
  
  guides(color =F) + 
  NULL

We see that this relationship holds across a variety of properties.

2.3.14 Paired relationship overview

We plotted a GG pairs plot to get a comprehensive overview of the relationship between the price of listings and the selected variables.

We only plot the numerical variables where we can get a correlation and scatterplot of the relationship

#Relatonship between the price and variables, starting with numerical variables
listings2 %>% 
  
  #Choosing our variables
  select(price, #Variable we are trying to explain
         accommodates, 
         review_scores_rating, 
         number_of_reviews, 
         minimum_nights, 
         reviews_per_month, 
         bedrooms, 
         beds, 
         availability_30,
         last_review) %>% 
  
  #Size and aesthetics
  ggpairs(aes(alpha=0.2))+
  theme_minimal(base_size=8)

From the above we see that for price the most correlated variables is the number of accomodates or bedrooms (although these are correlated with 0.73 and will likely not both be included). However, from here it seems that perhaps it is better to focus on accommodates rather than bedrooms.

It is worth noting that some of the less correlated variables seem to be review scores, number of reviews, however, we will still try and test these in our regression.

Unfortunately, GGPairs does not work well with categorical variables and we will use that data we extracted above as an indicator for whether or not these variables will be useful going forward. `

3 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights. We first create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

3.1 Predicted Variable

We assume that this leaves us with Airbnbs accommodating 2 or more people. I.e. we filter out Airbnb’s which accomodate 1 person.

Also, we assume that no other people would go to the airbnb lowering the cost per person (i.e. 2 people could rent a 10 people Airbnb but would be the only two to pay).

Lets create the cost for two people and filter for the apartments that have 2 people min:

##Fist, we have to filter the data to represent 2 people staying for 4 nights.
listings_4 <- listings2 %>% 
  filter(accommodates>=2) %>% 
  mutate(price_4_nights =price*4)

We take logarithm of ’price_4_nights‘ and create the density plot, because it looks more like the normal distribution. See below:

# Plot density of price using price_4_nights

#Using price 4 nights without log
ggplot(listings_4,aes(x=price_4_nights))+
  
  #In a density plot
  geom_density() +   
  
  #Useful labels
  labs(      
      title = "Density plot for prices for 4 nights",
      x = "price per night (log)",
      y = "Density") + 
  
  #Simple Theme
      theme_classic() + 
  
    NULL

# Plot density of price using log(price_4_nights)
ggplot(listings_4,aes(x=log(price_4_nights)))+
  
  #In density Plot
  geom_density() +   
  
  #usfeul Labs
  labs(      
      title = "Density plot for prices for 4 nights",
      x = "price per night (log)",
      y = "Density") + 
  
  #Sumple Theme
      theme_classic() + 
  
    NULL

We see that taking the log of prices brings us much closer to a normal distribution why we use log prices going forward.

3.2 Models

3.2.1 Model 1: property type, number of reviews and the rating scores

We first fit a regression model with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

First, let us split the data into a training and testing set. The training will be used for the models and the testing later on:

#Set seed for regeneration purposes
set.seed(123456)


#Splitting 75% into a training set
train_test_split <- initial_split(listings_4, prop = 0.75)

#New variable name for training set
listings_train <- training(train_test_split)

#New Variable name for test set 
listings_test <- testing(train_test_split)

Lets now create the model:

#Create the model1
model1 <- lm(log(price_4_nights) ~ #Predicted Variable
               
               #Explanatory Variables
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating,
             
             #Dataset
             data=listings_train)

#Summary of Model 1
summary(model1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5932 -0.2978 -0.0354  0.2593  4.0634 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.836e+00  7.097e-02  82.232
prop_type_simplifiedEntire residential home      2.252e-01  2.056e-02  10.958
prop_type_simplifiedEntire townhouse             2.923e-01  3.130e-02   9.339
prop_type_simplifiedOther                       -6.992e-02  1.362e-02  -5.135
prop_type_simplifiedPrivate room in rental unit -5.324e-01  1.654e-02 -32.188
number_of_reviews                               -5.425e-04  8.414e-05  -6.447
review_scores_rating                             1.107e-01  1.480e-02   7.483
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                       2.88e-07 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                               1.20e-10 ***
review_scores_rating                            8.03e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 8356 degrees of freedom
Multiple R-squared:  0.1707,    Adjusted R-squared:  0.1701 
F-statistic: 286.6 on 6 and 8356 DF,  p-value: < 2.2e-16

We come to an adjusted R squared of 0.1707 which is a good start. We see that all variables have significant t-values indicating they are significantly different from 0.

More specifically these values changes as the category changes from entire rental unit (the variable excluded)

  • Entire residential home increases log of price 0.2
  • Entire Townhouse increases log of price 0.28
  • Other decreases log of price 0.007
  • Private room decreases log of price 0.53

These numbers are compared to if the unit is entire rental Unit which is the variable excluded (Categorical Variables exclude one variable)

For review scores rating we see that as rating goes up 1 unit log of price goes up 0.019.

However, as number of reviews goes up price goes down. For every 100 additional reviews price falls roughly 0.05 (Likely because those properties are less premium).

We recall from earlier that reviews should account for room type which is included now:

3.2.2 Model 2: Room Type

Now lets add the room type

#Create the model2
model2 <- lm(log(price_4_nights) ~ #Predicted Variable
               
               #Explanatory Variables
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type,
             
             #Dataset
             data=listings_train)

#Summary of Model 2
summary(model2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5898 -0.2870 -0.0362  0.2469  4.0502 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.970e+00  6.894e-02  86.594
prop_type_simplifiedEntire residential home      2.269e-01  1.990e-02  11.405
prop_type_simplifiedEntire townhouse             2.953e-01  3.030e-02   9.745
prop_type_simplifiedOther                        2.389e-01  1.871e-02  12.768
prop_type_simplifiedPrivate room in rental unit  1.185e-02  2.810e-02   0.422
number_of_reviews                               -1.949e-04  8.281e-05  -2.354
review_scores_rating                             8.157e-02  1.438e-02   5.673
room_typeHotel room                             -2.579e-01  6.379e-02  -4.043
room_typePrivate room                           -5.622e-01  2.385e-02 -23.569
room_typeShared room                            -6.104e-01  1.064e-01  -5.737
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit   0.6732    
number_of_reviews                                 0.0186 *  
review_scores_rating                            1.45e-08 ***
room_typeHotel room                             5.33e-05 ***
room_typePrivate room                            < 2e-16 ***
room_typeShared room                            9.96e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4324 on 8353 degrees of freedom
Multiple R-squared:  0.2231,    Adjusted R-squared:  0.2223 
F-statistic: 266.6 on 9 and 8353 DF,  p-value: < 2.2e-16

We see that hotel rooms, private rooms and shared rooms all lower the price compared to entire home/apt.

  • Hotels lower log price -0.25
  • Private rooms lower log price -0.56
  • Shared rooms lower log price -0.61

Our model has higher Adjusted Rsquared but some variables are no longer significant.

Lets now test for VIF given our model has some p-value issues:

car::vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.162167  4        1.195129
number_of_reviews    1.141835  1        1.068567
review_scores_rating 1.020372  1        1.010135
room_type            4.292148  3        1.274811

Let’s keep the variables in given they are below 5 but there is some overlap between property type and room type.

We may have to account for this later on.

3.2.3 Model 3: Bedrooms, beds and accomodates

We want to further determine whether the number ofbedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights. Lets add it to our model.

We keep in mind that these might be correlated and will test VIF after.

#New model
model3_1 <- lm(
  
  #Predicted Variable
  log(price_4_nights) ~ 
                 
    
                #Explanatory Variables
                 prop_type_simplified+
                 number_of_reviews+
                 review_scores_rating +
                 room_type + 
                 beds +  
                 bedrooms + 
                 accommodates, 
               
            #dataset
               data = listings_train)

summary(model3_1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type + beds + bedrooms + accommodates, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4339 -0.2347 -0.0146  0.2135  4.0939 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.308e+00  6.297e-02  84.295
prop_type_simplifiedEntire residential home     -7.360e-02  1.903e-02  -3.868
prop_type_simplifiedEntire townhouse             5.799e-03  2.790e-02   0.208
prop_type_simplifiedOther                        1.478e-01  1.669e-02   8.855
prop_type_simplifiedPrivate room in rental unit -3.664e-02  2.490e-02  -1.472
number_of_reviews                               -2.342e-04  7.353e-05  -3.186
review_scores_rating                             1.106e-01  1.278e-02   8.652
room_typeHotel room                             -7.229e-02  5.660e-02  -1.277
room_typePrivate room                           -3.881e-01  2.144e-02 -18.106
room_typeShared room                            -4.355e-01  9.433e-02  -4.616
beds                                             9.345e-03  5.935e-03   1.574
bedrooms                                         1.234e-01  9.302e-03  13.272
accommodates                                     1.165e-01  6.016e-03  19.368
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      0.00011 ***
prop_type_simplifiedEntire townhouse             0.83537    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  0.14117    
number_of_reviews                                0.00145 ** 
review_scores_rating                             < 2e-16 ***
room_typeHotel room                              0.20157    
room_typePrivate room                            < 2e-16 ***
room_typeShared room                            3.97e-06 ***
beds                                             0.11541    
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3828 on 8350 degrees of freedom
Multiple R-squared:  0.3914,    Adjusted R-squared:  0.3906 
F-statistic: 447.6 on 12 and 8350 DF,  p-value: < 2.2e-16

We notice that beds are not significant and will take that out.

#New model
model3_2 <- lm(
  
  #Predicted Variable
  log(price_4_nights) ~ 
                 
    
                #Explanatory Variables
                 prop_type_simplified+
                 number_of_reviews+
                 review_scores_rating +
                 room_type + 
                 bedrooms + 
                 accommodates, 
               
            #dataset
               data = listings_train)

summary(model3_2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type + bedrooms + accommodates, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4334 -0.2340 -0.0144  0.2132  4.0992 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.302e+00  6.285e-02  84.349
prop_type_simplifiedEntire residential home     -7.151e-02  1.898e-02  -3.767
prop_type_simplifiedEntire townhouse             8.377e-03  2.786e-02   0.301
prop_type_simplifiedOther                        1.489e-01  1.667e-02   8.929
prop_type_simplifiedPrivate room in rental unit -3.620e-02  2.490e-02  -1.454
number_of_reviews                               -2.283e-04  7.344e-05  -3.108
review_scores_rating                             1.105e-01  1.278e-02   8.648
room_typeHotel room                             -7.243e-02  5.661e-02  -1.279
room_typePrivate room                           -3.874e-01  2.143e-02 -18.074
room_typeShared room                            -4.307e-01  9.429e-02  -4.568
bedrooms                                         1.291e-01  8.582e-03  15.042
accommodates                                     1.212e-01  5.245e-03  23.100
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home     0.000166 ***
prop_type_simplifiedEntire townhouse            0.763655    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.145961    
number_of_reviews                               0.001889 ** 
review_scores_rating                             < 2e-16 ***
room_typeHotel room                             0.200763    
room_typePrivate room                            < 2e-16 ***
room_typeShared room                               5e-06 ***
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3828 on 8351 degrees of freedom
Multiple R-squared:  0.3912,    Adjusted R-squared:  0.3904 
F-statistic: 487.9 on 11 and 8351 DF,  p-value: < 2.2e-16

We also notice that some variables are no longer significant.

Let’s check VIF

vif(model3_2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 5.152846  4        1.227456
number_of_reviews    1.145956  1        1.070493
review_scores_rating 1.028391  1        1.014096
room_type            4.429142  3        1.281504
bedrooms             2.473767  1        1.572821
accommodates         2.182146  1        1.477209

We now see quite a high variance inflation factor for Prop_type_simplified, and Room Type. This makes sense as the room type (i.e. hotel room) is related to property type (i.e. a hotel)

Let’s try and run and see each model without one another.

#Same model but without Property Type
model3_3 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates, 
  
  data = listings_train)

#Same model but without Room Type
model3_4 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    prop_type_simplified+
    number_of_reviews+
    review_scores_rating + 
    bedrooms + 
    accommodates, 
  
  data = listings_train)

vif(model3_3)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.139493  1        1.067471
review_scores_rating 1.026732  1        1.013278
room_type            1.229365  3        1.035015
bedrooms             2.207683  1        1.485827
accommodates         2.163928  1        1.471029
vif(model3_4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.430238  4        1.045746
number_of_reviews    1.109832  1        1.053486
review_scores_rating 1.019848  1        1.009875
bedrooms             2.458383  1        1.567923
accommodates         2.173088  1        1.474140

We immidieatly lower our variance inflation factor.

Now let’s try and decide which model is stronger going forward.

summary(model3_3)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4421 -0.2413 -0.0190  0.2125  4.0866 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.2844695  0.0634856  83.239  < 2e-16 ***
number_of_reviews     -0.0001524  0.0000741  -2.057  0.03968 *  
review_scores_rating   0.1155040  0.0129204   8.940  < 2e-16 ***
room_typeHotel room    0.0628398  0.0552016   1.138  0.25500    
room_typePrivate room -0.3499489  0.0116290 -30.093  < 2e-16 ***
room_typeShared room  -0.2964919  0.0941722  -3.148  0.00165 ** 
bedrooms               0.1187920  0.0082029  14.482  < 2e-16 ***
accommodates           0.1266656  0.0052843  23.970  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3873 on 8355 degrees of freedom
Multiple R-squared:  0.3765,    Adjusted R-squared:  0.376 
F-statistic: 720.9 on 7 and 8355 DF,  p-value: < 2.2e-16
summary(model3_4)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bedrooms + accommodates, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4258 -0.2358 -0.0135  0.2199  4.1128 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.174e+00  6.369e-02  81.235
prop_type_simplifiedEntire residential home     -9.289e-02  1.932e-02  -4.808
prop_type_simplifiedEntire townhouse            -1.365e-02  2.838e-02  -0.481
prop_type_simplifiedOther                       -5.977e-02  1.191e-02  -5.017
prop_type_simplifiedPrivate room in rental unit -4.037e-01  1.468e-02 -27.502
number_of_reviews                               -4.579e-04  7.372e-05  -6.212
review_scores_rating                             1.313e-01  1.298e-02  10.115
bedrooms                                         1.409e-01  8.726e-03  16.141
accommodates                                     1.267e-01  5.338e-03  23.737
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home     1.55e-06 ***
prop_type_simplifiedEntire townhouse               0.631    
prop_type_simplifiedOther                       5.36e-07 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                               5.48e-10 ***
review_scores_rating                             < 2e-16 ***
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3905 on 8354 degrees of freedom
Multiple R-squared:  0.3665,    Adjusted R-squared:  0.3659 
F-statistic:   604 on 8 and 8354 DF,  p-value: < 2.2e-16

We see that model 3_2 (incl. room type) provides better explanatory power than 3_3 (incl. property type), and will utilize that going forward.

This model has an adjusted r square of 0.3659 and adds variables beds, bedrooms and accommodates.

  • Each Bedroom adds 0.118 to the log price
  • Each Accommodates increase adds 0.126 to the log price.

3.2.4 Model4: Superhosts

We want to explore whether the superhosts (host_is_superhost) command a pricing premium, after controlling for other variables.

Let us build this model on model 3.2

#Model 3.2 incl. superhost 
model4 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates+ 
    host_is_superhost, 
  
  data = listings_train)

summary(model4)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4313 -0.2353 -0.0155  0.2141  4.0933 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.369e+00  6.382e-02  84.134  < 2e-16 ***
number_of_reviews     -3.479e-04  7.665e-05  -4.538 5.76e-06 ***
review_scores_rating   9.681e-02  1.301e-02   7.441 1.10e-13 ***
room_typeHotel room    1.667e-02  5.514e-02   0.302  0.76237    
room_typePrivate room -3.705e-01  1.178e-02 -31.455  < 2e-16 ***
room_typeShared room  -2.850e-01  9.370e-02  -3.041  0.00236 ** 
bedrooms               1.220e-01  8.168e-03  14.931  < 2e-16 ***
accommodates           1.239e-01  5.266e-03  23.523  < 2e-16 ***
host_is_superhostTRUE  1.239e-01  1.332e-02   9.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3854 on 8354 degrees of freedom
Multiple R-squared:  0.3829,    Adjusted R-squared:  0.3823 
F-statistic:   648 on 8 and 8354 DF,  p-value: < 2.2e-16

The variable is significant however, it only has very little explanatory power.

A superhost adds 0.123 to the log price.

Let’s try and compute the VIF once again to check out model

vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.232011  1        1.109960
review_scores_rating 1.051819  1        1.025582
room_type            1.282096  3        1.042286
bedrooms             2.211537  1        1.487124
accommodates         2.170987  1        1.473427
host_is_superhost    1.210964  1        1.100438

We see that VIF is below 5 for all variables meaning the variables do not have too high correlation

3.2.5 Model 5, Immidiate bookings

  1. Some hosts allow you to immediately book their listing. Let’s see if affter controlling for other variables, is instant_bookable a significant predictor of price_4_nights
#Model 4.0 incl. immidiate bookings 
model5 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates+ 
    host_is_superhost+ 
    instant_bookable, 
  
  data = listings_train)

summary(model5)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4236 -0.2342 -0.0153  0.2152  4.0506 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.334e+00  6.409e-02  83.227  < 2e-16 ***
number_of_reviews     -3.623e-04  7.659e-05  -4.730 2.28e-06 ***
review_scores_rating   1.028e-01  1.304e-02   7.881 3.65e-15 ***
room_typeHotel room   -1.783e-02  5.547e-02  -0.322  0.74783    
room_typePrivate room -3.811e-01  1.194e-02 -31.919  < 2e-16 ***
room_typeShared room  -2.811e-01  9.356e-02  -3.005  0.00267 ** 
bedrooms               1.245e-01  8.171e-03  15.235  < 2e-16 ***
accommodates           1.220e-01  5.270e-03  23.154  < 2e-16 ***
host_is_superhostTRUE  1.180e-01  1.335e-02   8.840  < 2e-16 ***
instant_bookableTRUE   5.513e-02  1.071e-02   5.149 2.68e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3848 on 8353 degrees of freedom
Multiple R-squared:  0.3849,    Adjusted R-squared:  0.3842 
F-statistic: 580.7 on 9 and 8353 DF,  p-value: < 2.2e-16

Here we see instant bookable is a significant predictor which will can be used in our model.

If a hotel is instantly bookable log price goes up by 0.055

We once again control VIF

vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.233658  1        1.110702
review_scores_rating 1.060229  1        1.029674
room_type            1.336159  3        1.049485
bedrooms             2.219462  1        1.489786
accommodates         2.181078  1        1.476847
host_is_superhost    1.219980  1        1.104527
instant_bookable     1.101576  1        1.049560

Still these factors are not correlated enough for us to be worried.

3.2.6 Model 6: Neighboughoods

For this section we have not chosen the neighbourhood_overview variables as it had a lot of NAs. We use the categories we set up earlier which split neighbourhoods into the most costly and least costly area.

We therefore expect it to have high explanatory power.

#Model 5 + neighbourhoods
model6 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed, 
  
  #Dataset
  data = listings_train)

summary(model6)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5398 -0.2242 -0.0157  0.2040  4.0793 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.048e+00  6.347e-02  79.527  < 2e-16 ***
number_of_reviews               -4.719e-04  7.316e-05  -6.450 1.18e-10 ***
review_scores_rating             1.101e-01  1.244e-02   8.845  < 2e-16 ***
room_typeHotel room             -2.245e-03  5.295e-02  -0.042  0.96618    
room_typePrivate room           -3.705e-01  1.151e-02 -32.186  < 2e-16 ***
room_typeShared room            -2.567e-01  8.927e-02  -2.875  0.00405 ** 
bedrooms                         1.329e-01  7.804e-03  17.032  < 2e-16 ***
accommodates                     1.193e-01  5.029e-03  23.728  < 2e-16 ***
host_is_superhostTRUE            1.099e-01  1.274e-02   8.632  < 2e-16 ***
instant_bookableTRUE             5.557e-02  1.022e-02   5.435 5.62e-08 ***
neighbourhood_cleansedTop 11-15  8.619e-02  1.936e-02   4.453 8.60e-06 ***
neighbourhood_cleansedTop 5      3.658e-01  1.739e-02  21.034  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.274e-01  1.757e-02  12.939  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.367 on 8350 degrees of freedom
Multiple R-squared:  0.4406,    Adjusted R-squared:  0.4398 
F-statistic: 548.1 on 12 and 8350 DF,  p-value: < 2.2e-16

We see that neighbourgood categorization is a really impactful driver on our prices as our adjusted r square went to ca. 44%.

We see that compared to the bottom 7 neighbourhoods the prices are impacted as such:

  • Top 5 areas add 0.366 to log prices
  • Top 6-10 areas add 0.227 to log prices
  • Top 11-15 areas add 0.086 to prices

Testing VIF

vif(model6)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.237386  1        1.112378
review_scores_rating   1.060844  1        1.029973
room_type              1.367241  3        1.053515
bedrooms               2.225889  1        1.491941
accommodates           2.182781  1        1.477424
host_is_superhost      1.220813  1        1.104904
instant_bookable       1.103987  1        1.050708
neighbourhood_cleansed 1.043262  3        1.007084

Still VIF looks fine.

3.2.7 Model 7: Availability

Next lets look at availability in the coming 30 days

#Model 6 + 30 day availability
model7 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30, 
  
  #Dataset
  data = listings_train)

summary(model7)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5106 -0.2170 -0.0067  0.1992  4.1299 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.029e+00  6.070e-02  82.851  < 2e-16 ***
number_of_reviews               -6.802e-04  7.035e-05  -9.668  < 2e-16 ***
review_scores_rating             1.103e-01  1.190e-02   9.268  < 2e-16 ***
room_typeHotel room             -9.324e-02  5.073e-02  -1.838  0.06612 .  
room_typePrivate room           -4.307e-01  1.122e-02 -38.403  < 2e-16 ***
room_typeShared room            -2.978e-01  8.538e-02  -3.488  0.00049 ***
bedrooms                         1.453e-01  7.476e-03  19.437  < 2e-16 ***
accommodates                     1.061e-01  4.832e-03  21.952  < 2e-16 ***
host_is_superhostTRUE            8.877e-02  1.220e-02   7.274 3.80e-13 ***
instant_bookableTRUE             3.716e-02  9.798e-03   3.793  0.00015 ***
neighbourhood_cleansedTop 11-15  9.902e-02  1.852e-02   5.348 9.15e-08 ***
neighbourhood_cleansedTop 5      3.715e-01  1.663e-02  22.338  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.453e-01  1.682e-02  14.590  < 2e-16 ***
availability_30                  1.385e-02  4.950e-04  27.979  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3509 on 8349 degrees of freedom
Multiple R-squared:  0.4886,    Adjusted R-squared:  0.4878 
F-statistic: 613.5 on 13 and 8349 DF,  p-value: < 2.2e-16

Once again, this variable is a significant explanatory variable. As availability over the next 30 days goes up with 1 log prices increase by 0.014

Let’s test for VIF once again:

vif(model7)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.251401  1        1.118660
review_scores_rating   1.060844  1        1.029973
room_type              1.422204  3        1.060458
bedrooms               2.233719  1        1.494563
accommodates           2.203964  1        1.484575
host_is_superhost      1.225527  1        1.107035
instant_bookable       1.108985  1        1.053084
neighbourhood_cleansed 1.046609  3        1.007621
availability_30        1.141229  1        1.068283

All VIFS are low and therefore we make no changes.

3.2.8 Model 8: License

#Model 7 + License
model8 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30 + 
    license, 
  
  #Dataset
  data = listings_train)

summary(model8)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30 + 
    license, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5100 -0.2168 -0.0067  0.1993  4.1306 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.031e+00  6.091e-02  82.598  < 2e-16 ***
number_of_reviews               -6.854e-04  7.128e-05  -9.615  < 2e-16 ***
review_scores_rating             1.098e-01  1.195e-02   9.190  < 2e-16 ***
room_typeHotel room             -9.457e-02  5.082e-02  -1.861 0.062800 .  
room_typePrivate room           -4.310e-01  1.124e-02 -38.361  < 2e-16 ***
room_typeShared room            -2.979e-01  8.538e-02  -3.488 0.000488 ***
bedrooms                         1.450e-01  7.500e-03  19.340  < 2e-16 ***
accommodates                     1.060e-01  4.833e-03  21.943  < 2e-16 ***
host_is_superhostTRUE            8.767e-02  1.244e-02   7.048 1.95e-12 ***
instant_bookableTRUE             3.688e-02  9.817e-03   3.757 0.000173 ***
neighbourhood_cleansedTop 11-15  9.889e-02  1.852e-02   5.340 9.56e-08 ***
neighbourhood_cleansedTop 5      3.713e-01  1.663e-02  22.322  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.452e-01  1.682e-02  14.578  < 2e-16 ***
availability_30                  1.379e-02  5.100e-04  27.047  < 2e-16 ***
licenseyes                       4.356e-03  9.586e-03   0.454 0.649569    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.351 on 8348 degrees of freedom
Multiple R-squared:  0.4886,    Adjusted R-squared:  0.4877 
F-statistic: 569.6 on 14 and 8348 DF,  p-value: < 2.2e-16

License is not a significant predictor of price when controlling for remaining variables. Let’s go to the next variable

3.2.9 Model 9: Last Review

#Model 7 + Last Review
model_final <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30 + 
    last_review, 
  
  #Dataset
  data = listings_train)

summary(model_final)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30 + 
    last_review, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4853 -0.2127 -0.0085  0.1972  4.1747 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.772e+00  1.422e-01  26.527  < 2e-16 ***
number_of_reviews               -7.711e-04  7.058e-05 -10.926  < 2e-16 ***
review_scores_rating             9.725e-02  1.191e-02   8.167 3.61e-16 ***
room_typeHotel room             -9.117e-02  5.045e-02  -1.807 0.070777 .  
room_typePrivate room           -4.327e-01  1.116e-02 -38.785  < 2e-16 ***
room_typeShared room            -2.818e-01  8.492e-02  -3.319 0.000909 ***
bedrooms                         1.435e-01  7.436e-03  19.293  < 2e-16 ***
accommodates                     1.056e-01  4.805e-03  21.968  < 2e-16 ***
host_is_superhostTRUE            6.673e-02  1.234e-02   5.407 6.60e-08 ***
instant_bookableTRUE             2.061e-02  9.889e-03   2.084 0.037160 *  
neighbourhood_cleansedTop 11-15  9.869e-02  1.841e-02   5.360 8.56e-08 ***
neighbourhood_cleansedTop 5      3.695e-01  1.654e-02  22.341  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.450e-01  1.672e-02  14.653  < 2e-16 ***
availability_30                  1.269e-02  5.063e-04  25.056  < 2e-16 ***
last_review                      7.466e-05  7.646e-06   9.764  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.349 on 8348 degrees of freedom
Multiple R-squared:  0.4943,    Adjusted R-squared:  0.4935 
F-statistic: 582.9 on 14 and 8348 DF,  p-value: < 2.2e-16

Last review is a significant predicotr of price and for each day of more recent reviews the prices are 0.00007 higher.

3.3 Diagnostics, collinearity, summary tables

3.3.1 VIF

Let’s check the VIF for our final model

vif(model_final)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.273594  1        1.128536
review_scores_rating   1.074327  1        1.036497
room_type              1.423269  3        1.060591
bedrooms               2.235151  1        1.495042
accommodates           2.204224  1        1.484663
host_is_superhost      1.267912  1        1.126016
instant_bookable       1.142555  1        1.068904
neighbourhood_cleansed 1.047138  3        1.007706
availability_30        1.207846  1        1.099021
last_review            1.278776  1        1.130830

Variance inflation factor is below 5 for all variables which shows no sign of autocorrelation.

3.3.2 Residual values and normal distribution

Let’s also plot the model using the autoplot to check for potential issues:

autoplot(model_final)

For our model we see that resudual errors are evenly distributed and do not change as the variable fitted values change. This is a good sign for our model.

Futhermore we see from the theoretical quantiles that our model is approximately evenly distributed but not 100% evenly distributed.

3.3.3 Summary Huxtable

Let’s summarize our models in a table

huxreg(list(
  "Prop Type, Reviews, Rating" = model1, 
  "+ Room Type" = model2, 
  "+ Beds + Bedrooms + Accom." = model3_1, 
  "- Beds" = model3_2, 
  "- Prop Type" = model3_3, 
  " - Room Type + Prop Type"  = model3_4, 
  "+ Superhost" = model4, 
  "+ immidiate bookings" = model5, 
  "+ Neighbourhoods" = model6, 
  "+ Availability" = model7, 
  "+ license" = model8, 
  "+ last_review" = model_final))
Prop Type, Reviews, Rating+ Room Type+ Beds + Bedrooms + Accom.- Beds- Prop Type - Room Type + Prop Type+ Superhost+ immidiate bookings+ Neighbourhoods+ Availability+ license+ last_review
(Intercept)5.836 ***5.970 ***5.308 ***5.302 ***5.284 ***5.174 ***5.369 ***5.334 ***5.048 ***5.029 ***5.031 ***3.772 ***
(0.071)   (0.069)   (0.063)   (0.063)   (0.063)   (0.064)   (0.064)   (0.064)   (0.063)   (0.061)   (0.061)   (0.142)   
prop_type_simplifiedEntire residential home0.225 ***0.227 ***-0.074 ***-0.072 ***        -0.093 ***                                                
(0.021)   (0.020)   (0.019)   (0.019)           (0.019)                                                   
prop_type_simplifiedEntire townhouse0.292 ***0.295 ***0.006    0.008            -0.014                                                    
(0.031)   (0.030)   (0.028)   (0.028)           (0.028)                                                   
prop_type_simplifiedOther-0.070 ***0.239 ***0.148 ***0.149 ***        -0.060 ***                                                
(0.014)   (0.019)   (0.017)   (0.017)           (0.012)                                                   
prop_type_simplifiedPrivate room in rental unit-0.532 ***0.012    -0.037    -0.036            -0.404 ***                                                
(0.017)   (0.028)   (0.025)   (0.025)           (0.015)                                                   
number_of_reviews-0.001 ***-0.000 *  -0.000 ** -0.000 ** -0.000 *  -0.000 ***-0.000 ***-0.000 ***-0.000 ***-0.001 ***-0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.111 ***0.082 ***0.111 ***0.111 ***0.116 ***0.131 ***0.097 ***0.103 ***0.110 ***0.110 ***0.110 ***0.097 ***
(0.015)   (0.014)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   (0.012)   
room_typeHotel room        -0.258 ***-0.072    -0.072    0.063            0.017    -0.018    -0.002    -0.093    -0.095    -0.091    
        (0.064)   (0.057)   (0.057)   (0.055)           (0.055)   (0.055)   (0.053)   (0.051)   (0.051)   (0.050)   
room_typePrivate room        -0.562 ***-0.388 ***-0.387 ***-0.350 ***        -0.371 ***-0.381 ***-0.371 ***-0.431 ***-0.431 ***-0.433 ***
        (0.024)   (0.021)   (0.021)   (0.012)           (0.012)   (0.012)   (0.012)   (0.011)   (0.011)   (0.011)   
room_typeShared room        -0.610 ***-0.435 ***-0.431 ***-0.296 **         -0.285 ** -0.281 ** -0.257 ** -0.298 ***-0.298 ***-0.282 ***
        (0.106)   (0.094)   (0.094)   (0.094)           (0.094)   (0.094)   (0.089)   (0.085)   (0.085)   (0.085)   
beds                0.009                                                                            
                (0.006)                                                                           
bedrooms                0.123 ***0.129 ***0.119 ***0.141 ***0.122 ***0.124 ***0.133 ***0.145 ***0.145 ***0.143 ***
                (0.009)   (0.009)   (0.008)   (0.009)   (0.008)   (0.008)   (0.008)   (0.007)   (0.007)   (0.007)   
accommodates                0.117 ***0.121 ***0.127 ***0.127 ***0.124 ***0.122 ***0.119 ***0.106 ***0.106 ***0.106 ***
                (0.006)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   
host_is_superhostTRUE                                                0.124 ***0.118 ***0.110 ***0.089 ***0.088 ***0.067 ***
                                                (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   
instant_bookableTRUE                                                        0.055 ***0.056 ***0.037 ***0.037 ***0.021 *  
                                                        (0.011)   (0.010)   (0.010)   (0.010)   (0.010)   
neighbourhood_cleansedTop 11-15                                                                0.086 ***0.099 ***0.099 ***0.099 ***
                                                                (0.019)   (0.019)   (0.019)   (0.018)   
neighbourhood_cleansedTop 5                                                                0.366 ***0.371 ***0.371 ***0.369 ***
                                                                (0.017)   (0.017)   (0.017)   (0.017)   
neighbourhood_cleansedTop 6-10                                                                0.227 ***0.245 ***0.245 ***0.245 ***
                                                                (0.018)   (0.017)   (0.017)   (0.017)   
availability_30                                                                        0.014 ***0.014 ***0.013 ***
                                                                        (0.000)   (0.001)   (0.001)   
licenseyes                                                                                0.004            
                                                                                (0.010)           
last_review                                                                                        0.000 ***
                                                                                        (0.000)   
N8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        
R20.171    0.223    0.391    0.391    0.377    0.366    0.383    0.385    0.441    0.489    0.489    0.494    
logLik-5123.513    -4850.132    -3829.296    -3830.537    -3930.350    -3997.392    -3887.254    -3874.003    -3476.954    -3102.201    -3102.097    -3054.720    
AIC10263.026    9722.264    7686.591    7687.074    7878.700    8014.784    7794.508    7770.007    6981.909    6234.401    6236.195    6141.439    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Our final model has an r-squared of 0.494 and excludes the license categorical variable (Y/N), beds (numerical), and property type.

3.3.4 Overfitting test

We now test our model for overfitting

#To get the RMSE from the training set
rmse_train <- listings_train %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model_final, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  
  #Printing RMSE
  pull()
rmse_train
[1] 1.236614e-11
#Repeated for test set
rmse_test <- listings_test %>% 
  
  #Calcualating RMSE
    mutate(predictions = predict(model_final, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  
  #Printing RMSE
  pull()
rmse_test
[1] 0.3799613

There seems to be some overfitting in our model. Let’s see if we can adjust the final model by taking out some of the less key variables

#Model final adjusted without instant_bookable and without last review:
model_final_2 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    neighbourhood_cleansed+ 
    availability_30, 

  #Dataset
  data = listings_train)

summary(model_final_2)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    neighbourhood_cleansed + availability_30, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5161 -0.2156 -0.0072  0.1992  4.1598 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.054e+00  6.040e-02  83.669  < 2e-16 ***
number_of_reviews               -6.726e-04  7.038e-05  -9.557  < 2e-16 ***
review_scores_rating             1.063e-01  1.186e-02   8.961  < 2e-16 ***
room_typeHotel room             -7.094e-02  5.043e-02  -1.407 0.159563    
room_typePrivate room           -4.244e-01  1.110e-02 -38.235  < 2e-16 ***
room_typeShared room            -3.011e-01  8.544e-02  -3.525 0.000427 ***
bedrooms                         1.437e-01  7.470e-03  19.236  < 2e-16 ***
accommodates                     1.072e-01  4.827e-03  22.203  < 2e-16 ***
host_is_superhostTRUE            9.250e-02  1.217e-02   7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15  9.865e-02  1.853e-02   5.323 1.05e-07 ***
neighbourhood_cleansedTop 5      3.709e-01  1.664e-02  22.287  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.436e-01  1.682e-02  14.480  < 2e-16 ***
availability_30                  1.397e-02  4.942e-04  28.275  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4869 
F-statistic: 662.4 on 12 and 8350 DF,  p-value: < 2.2e-16

We still keep quite a high explanatory power of 0.487% so let’s see again the RMSE

#To get the RMSE from the training set
rmse_train <- listings_train %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model_final_2, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  
  #Printing RMSE
  pull()
rmse_train
[1] 1.201184e-11
#Repeated for test set
rmse_test <- listings_test %>% 
  
  #Calcualating RMSE
    mutate(predictions = predict(model_final_2, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  
  #Printing RMSE
  pull()
rmse_test
[1] 0.3224991

We managed to lower RMSE by 0.05 by removing these variables while still keeping the RMSE low in our model.

3.4 Final Model

First let us summarize the final model:

summary(model_final_2)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    neighbourhood_cleansed + availability_30, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5161 -0.2156 -0.0072  0.1992  4.1598 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.054e+00  6.040e-02  83.669  < 2e-16 ***
number_of_reviews               -6.726e-04  7.038e-05  -9.557  < 2e-16 ***
review_scores_rating             1.063e-01  1.186e-02   8.961  < 2e-16 ***
room_typeHotel room             -7.094e-02  5.043e-02  -1.407 0.159563    
room_typePrivate room           -4.244e-01  1.110e-02 -38.235  < 2e-16 ***
room_typeShared room            -3.011e-01  8.544e-02  -3.525 0.000427 ***
bedrooms                         1.437e-01  7.470e-03  19.236  < 2e-16 ***
accommodates                     1.072e-01  4.827e-03  22.203  < 2e-16 ***
host_is_superhostTRUE            9.250e-02  1.217e-02   7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15  9.865e-02  1.853e-02   5.323 1.05e-07 ***
neighbourhood_cleansedTop 5      3.709e-01  1.664e-02  22.287  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.436e-01  1.682e-02  14.480  < 2e-16 ***
availability_30                  1.397e-02  4.942e-04  28.275  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4869 
F-statistic: 662.4 on 12 and 8350 DF,  p-value: < 2.2e-16

Let’s check for issues with residuals.

autoplot(model_final_2)

Again we see there are no issues with residuals or with normal distribution.

vif(model_final_2)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.250398  1        1.118212
review_scores_rating   1.052602  1        1.025964
room_type              1.376315  3        1.054677
bedrooms               2.226404  1        1.492114
accommodates           2.195945  1        1.481872
host_is_superhost      1.217543  1        1.103423
neighbourhood_cleansed 1.044653  3        1.007307
availability_30        1.136085  1        1.065873

Also, again we still see no issues.

Lets plot our models

huxreg(list(
  "Prop Type, Reviews, Rating" = model1, 
  "+ Room Type" = model2, 
  "+ Beds + Bedrooms + Accom." = model3_1, 
  "- Beds" = model3_2, 
  "- Prop Type" = model3_3, 
  " - Room Type + Prop Type"  = model3_4, 
  "+ Superhost" = model4, 
  "+ immidiate bookings" = model5, 
  "+ Neighbourhoods" = model6, 
  "+ Availability" = model7, 
  "+ license" = model8, 
  "FINAL - instant bookable - last_review" = model_final_2))
Prop Type, Reviews, Rating+ Room Type+ Beds + Bedrooms + Accom.- Beds- Prop Type - Room Type + Prop Type+ Superhost+ immidiate bookings+ Neighbourhoods+ Availability+ licenseFINAL - instant bookable - last_review
(Intercept)5.836 ***5.970 ***5.308 ***5.302 ***5.284 ***5.174 ***5.369 ***5.334 ***5.048 ***5.029 ***5.031 ***5.054 ***
(0.071)   (0.069)   (0.063)   (0.063)   (0.063)   (0.064)   (0.064)   (0.064)   (0.063)   (0.061)   (0.061)   (0.060)   
prop_type_simplifiedEntire residential home0.225 ***0.227 ***-0.074 ***-0.072 ***        -0.093 ***                                                
(0.021)   (0.020)   (0.019)   (0.019)           (0.019)                                                   
prop_type_simplifiedEntire townhouse0.292 ***0.295 ***0.006    0.008            -0.014                                                    
(0.031)   (0.030)   (0.028)   (0.028)           (0.028)                                                   
prop_type_simplifiedOther-0.070 ***0.239 ***0.148 ***0.149 ***        -0.060 ***                                                
(0.014)   (0.019)   (0.017)   (0.017)           (0.012)                                                   
prop_type_simplifiedPrivate room in rental unit-0.532 ***0.012    -0.037    -0.036            -0.404 ***                                                
(0.017)   (0.028)   (0.025)   (0.025)           (0.015)                                                   
number_of_reviews-0.001 ***-0.000 *  -0.000 ** -0.000 ** -0.000 *  -0.000 ***-0.000 ***-0.000 ***-0.000 ***-0.001 ***-0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.111 ***0.082 ***0.111 ***0.111 ***0.116 ***0.131 ***0.097 ***0.103 ***0.110 ***0.110 ***0.110 ***0.106 ***
(0.015)   (0.014)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   (0.012)   
room_typeHotel room        -0.258 ***-0.072    -0.072    0.063            0.017    -0.018    -0.002    -0.093    -0.095    -0.071    
        (0.064)   (0.057)   (0.057)   (0.055)           (0.055)   (0.055)   (0.053)   (0.051)   (0.051)   (0.050)   
room_typePrivate room        -0.562 ***-0.388 ***-0.387 ***-0.350 ***        -0.371 ***-0.381 ***-0.371 ***-0.431 ***-0.431 ***-0.424 ***
        (0.024)   (0.021)   (0.021)   (0.012)           (0.012)   (0.012)   (0.012)   (0.011)   (0.011)   (0.011)   
room_typeShared room        -0.610 ***-0.435 ***-0.431 ***-0.296 **         -0.285 ** -0.281 ** -0.257 ** -0.298 ***-0.298 ***-0.301 ***
        (0.106)   (0.094)   (0.094)   (0.094)           (0.094)   (0.094)   (0.089)   (0.085)   (0.085)   (0.085)   
beds                0.009                                                                            
                (0.006)                                                                           
bedrooms                0.123 ***0.129 ***0.119 ***0.141 ***0.122 ***0.124 ***0.133 ***0.145 ***0.145 ***0.144 ***
                (0.009)   (0.009)   (0.008)   (0.009)   (0.008)   (0.008)   (0.008)   (0.007)   (0.007)   (0.007)   
accommodates                0.117 ***0.121 ***0.127 ***0.127 ***0.124 ***0.122 ***0.119 ***0.106 ***0.106 ***0.107 ***
                (0.006)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   
host_is_superhostTRUE                                                0.124 ***0.118 ***0.110 ***0.089 ***0.088 ***0.093 ***
                                                (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   
instant_bookableTRUE                                                        0.055 ***0.056 ***0.037 ***0.037 ***        
                                                        (0.011)   (0.010)   (0.010)   (0.010)           
neighbourhood_cleansedTop 11-15                                                                0.086 ***0.099 ***0.099 ***0.099 ***
                                                                (0.019)   (0.019)   (0.019)   (0.019)   
neighbourhood_cleansedTop 5                                                                0.366 ***0.371 ***0.371 ***0.371 ***
                                                                (0.017)   (0.017)   (0.017)   (0.017)   
neighbourhood_cleansedTop 6-10                                                                0.227 ***0.245 ***0.245 ***0.244 ***
                                                                (0.018)   (0.017)   (0.017)   (0.017)   
availability_30                                                                        0.014 ***0.014 ***0.014 ***
                                                                        (0.000)   (0.001)   (0.000)   
licenseyes                                                                                0.004            
                                                                                (0.010)           
N8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        
R20.171    0.223    0.391    0.391    0.377    0.366    0.383    0.385    0.441    0.489    0.489    0.488    
logLik-5123.513    -4850.132    -3829.296    -3830.537    -3930.350    -3997.392    -3887.254    -3874.003    -3476.954    -3102.201    -3102.097    -3109.400    
AIC10263.026    9722.264    7686.591    7687.074    7878.700    8014.784    7794.508    7770.007    6981.909    6234.401    6236.195    6246.800    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We see now we have significant variables and rhe model 2nd model is almost equivilantly good at predicting prices.

Interestingly, our model only predicts roughly 50% of price variations. The remaining 50% is explained by variables not included here.

50% does seem high given the final model is only using 8 different variables with some of the being categorical.

3.5 Model test on Airbnb’s

We will now try to use our model to predict prices for staying for 4 nights, in a private room, with at least 10 reviews, and reviews above 4.5

First let us define our target segment

#Utilizing the full dataset
targets <- listings_4 %>%
  
  #BY Private Room
  filter(room_type == "Private room", 
         
         #At least 10 reviews
         number_of_reviews >= 10, 
         
         #A score above 4.5
         review_scores_rating >= 4.5) %>%
  
  #Predicted values in log
  mutate(log_predicted_values = predict(model_final, .),
         
         #Predicted Values Nominal
         predicted_price = exp(log_predicted_values)) 


#Summarizing statistics
targets_test <- targets %>% 
  
  summarise(
    average_price=mean(predicted_price), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(predicted_price), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 


targets_test
average_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
38110413571.962.835.55376387
skim(targets$predicted_price)
Data summary
Name targets$predicted_price
Number of rows 1357
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 381.26 104.13 186.4 310.32 359.16 434.95 903.71 ▅▇▂▁▁

The point prediction is the mean equivalent of a price for 2 people for 4 nights of 381.25

The confindence interval for the mean payment of 2 people spending 4 nights is 376-387 $.

4 Acknowledgements